A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data
نویسندگان
چکیده
Efficient estimation of the moments and Shannon entropy of data streams is an important task in modern machine learning and data mining. To estimate the Shannon entropy, it suffices to accurately estimate the α-th moment with ∆ = |1 − α| ≈ 0. To guarantee that the error of estimated Shannon entropy is within a ν-additive factor, the method of symmetric stable random projections requires O ( 1 ν2∆2 ) samples, which is extremely expensive. The first paper (Li, 2009a) in Compressed Counting (CC), based on skewed-stable random projections, supplies a substantial improvement by reducing the sample complexity to O ( 1 ν2∆ ) , which is still expensive. The followup work (Li, 2009b) provides a practical algorithm, which is however difficult to analyze theoretically. In this paper, we propose a new accurate algorithm for Compressed Counting, whose sample complexity is only O ( 1 ν2 ) for ν-additive Shannon entropy estimation. The constant factor for this bound is merely about 6. In addition, we prove that our algorithm achieves an upper bound of the Fisher information and in fact it is close to 100% statistically optimal. An empirical study is conducted to verify the accuracy of our algorithm.
منابع مشابه
Improving Compressed Counting
Compressed Counting (CC) [22] was recently proposed for estimating the αth frequency moments of data streams, where 0 < α ≤ 2. CC can be used for estimating Shannon entropy, which can be approximated by certain functions of the αth frequency moments as α → 1. Monitoring Shannon entropy for anomaly detection (e.g., DDoS attacks) in large networks is an important task. This paper presents a new a...
متن کاملEntropy Estimations Using Correlated Symmetric Stable Random Projections
Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is...
متن کاملEstimating Entropy of Data Streams Using Compressed Counting
The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Rényi entropy or Tsallis entropy, which are both functions of the αth frequency moments and approach Shannon entropy a...
متن کاملOn Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting
Abstract The long-standing problem of Shannon entropy estimation in data streams (assuming the strict Turnstile model) is now an easy task by using the technique proposed in this paper. Essentially speaking, in order to estimate the Shannon entropy with a guaranteed ν-additive accuracy, it suffices to estimate the αth frequency moment, where α = 1−∆, with a guaranteed ǫ-multiplicative accuracy,...
متن کاملA Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting
Compressed Counting (CC) was recently proposed for approximating the αth frequency moments of data streams, for 0 < α ≤ 2. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections, especially as α → 1. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measure...
متن کامل